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Introduction 


Introduction to Deep Learning 


What & Why 
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History of Artificial Intelligence 


First Wave: Herbert Simon: Second wave: Deep Blue] | Tim B. Lee (2001) 
Cybernetics, "Within the next Pentagon and the} || wins triggers new 
programmable 20 years, Japanese against the standards for the 
computers, turing machines will be government reighning Semantic Web and 
test, cognitive able all things a invest billions on champion related technologies. 


science, human is able to Al projects. in chess. Big Data promises to 
functionalism [...] do." 


allow hypothesis to 
be generated out 
massive amounts of 
data. 
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First KI Winter: Second KI Winter: 

US-american and British Department of Defense invested more than a 

government reduce the billion US Dollars into the Strategic Computing 

funding on Al due to Initiative (SDI). The Japanese Ministry of 

expectations not being met. Internationalen Trade and Industry (MITI) stops the 
Fifth Generation Computer Systems project. 
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Introduction 


Success Stories of Deep Learning 


Unsupervised high-level feature learning 
e@ Using a deep network of 1 billion parameters, 10 million images 
(sampled from YouTube), 1000 machines (16,000 cores) x 1 week. 
@ Evaluation 
e ImageNet data set (20,000 categories) 
e 0.005% random guessing 
e 9.5% state-of-the-art 
e 161% for deep architecture 
e 19.2% including pre-training 





https: //research.google. com/archive/unsupervised_icm12012. html = 
‘4 = 
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Introduction 


Success Stories of Deep Learning 


@ Primarily on speech recognition and images 


Interest by the big players 
Facebook 
e@ Face recognition 
@ https://research.facebook.com/publications/480567225376225/ 


deepface-closing-the-gap-to-human-1level-performance-in-face-verii 


@ Baidu 
e@ Speech recognition 
@ https://gigaom.com/2014/12/18/ 
baidu-claims-deep- learning-breakthrough-with-deep-speech/ 


Microsoft 


e Deep learning technology centre 
e e.g. NLP - Deep Semantic Similarity Model 


@ http://research.microsoft.com/en-us/projects/dssm/ 
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Introduction 


Prerequisite Knowledge 


@ Neural Networks 
e Backpropagation 
e Recurrent neural network (good for time series, NLP) 
@ Optimization 
e Generalisation (over-fitting), regularisation, early stopping 
e Logistic sigmoid, (stochastic) gradient descent 
@ Hyper Parameters 


e Number of layers, size of e.g. mini-batches, learning rate, ... 
e Grid search, manual search, a.k.a Graduate Student Descent (GSD) 
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Introduction 


Neural Network Properties 


e@ 1-layer networks can only separate linear problems (hyperplane) 


@ 2-layer networks with a non-linear activation function can express 
any continuous function (with an arbitrarily large number of hidden 
neurons) 


@ For more than 2 layers, one needs fewer nodes — therefore one 
wants have deep neuronal networks 
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Introduction 


Neural Network Properties 


@ Back propagation does not work well for more than 2 layers 
e Non-convex optimization function 
e Uses only local gradient information 
e Depends on initialisation 
e Gets trapped in local minima 
e Generalisation is poor 
e Cumulative backpropagation error signals either shrink rapidly or 
grow out of bounds (exponentially) (Hochreiter, 1991) 


@ Severity increases with the number of layers 
@ Focus shifted to convex optimization problems (e.g., SVM) 
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Deep Learning 


Deep Learning Approaches 


Overview of the most common techniques 
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Deep Learning Definition 


Definition of Deep Learning 


Several definitions exist 
Two key aspects: 
@ models consisting of multiple layers or stages of nonlinear 
information processing 
@ methods for supervised or unsupervised learning of feature 
representations at successively higher, more abstract layers 


Deep Learning architectures originated from, but are not limited to 
artificial neural networks 


contrasted by conventional shallow learning approaches 
not to be confused with deep learning in educational psychology: 


e “Deep learning describes an approach to learning that is 
characterized by active engagement, intrinsic motivation, and a 
personal search for meaning.” 
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Deep Learning Definition 


Example 


very high level representation: 





objects 
w. Ct ... 
1) edges, shapes, etc. 


slightly higher level representation 


Taw input vector representation: 


raw pixel values 


Y. Bengio (2009). Learning Deep Architectures for Al. Foundations and Trends in Machine 


Learning, 2(1), 1-127. 
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Deep Learning Definition 


Deep Learning vs. Shallow Learning 


When does shallow learning end and deep learning begin? 


What is the depth of an machine learning algorithm? 


Credit assignment path (CAP): chain of causal links between input 
and output 
Depth: length of CAP starting at the first modifiable link 
Examples: 
e Feed-forward network: depth = number of layers 
e Network with fixed random weights: depth = O 
e@ Network where only the output weights are trained (e.g., Echo State 
Network): depth = 1 
e Recurrent neural network: depth = length of input (potentially 
unlimited) 


@ Deep Learning: depth > 2; Very Deep Learning: depth > 10 
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Deep Learning History 


History 


@ The concept of deep learning originated from artificial neural 
network research 

@ The deep architecture of the human brain is a major inspiration: it 
successfully incorporates learning and information processing on 
multiple layers 


@ However, training ANNs with more than two hidden layers yielded 
poor results 


@ Breakthrough 2006 (Hinton et al.): Deep Belief Networks (DBN) 


@ Principle: training of intermediate representation levels using 
unsupervised learning, which can be performed locally at each 
level 
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Deep Learning Approaches 


Deep Belief Network (DBN) 


e@ A probabilistic, generative model composed of 
multiple simple learning modules that make up 






































each layer 

@ Typically, these learning modules are Restricted Spiele 
Boltzmann Machines (RBMs) If 

@ The top two layers have symmetric connections Hidden layer 2 
between them. The lower layers receive top-down | 
connections from the layer above Hidden layer 1 

@ Greedy layer-wise training: Each layer is \ 
successively trained on the output of the previous bia lesley 
layer 


@ Can be used for pre-training a network followed by 
fine tuning via backpropagation 
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Deep Learning Approaches 


Restricted Boltzmann Machine (RBM) 


@ Stochastic artificial neural network forming a 
bipartite graph 

@ The network learns a representation of the training 
data presented to the visible units 


Hidden units 


e@ The hidden units model statistical dependencies prekgl IE 
between the visible units e@: Sem 
@ Try to optimise the weights so that the likelihood of Be 
the data is maximised _eN 


P(hy = Iv) = 0 (bj + 37, viwy) 
P(v; = 1|h) =o (ai + yo hii) 
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Deep Learning Approaches 


Restricted Boltzmann Machine (RBM) 


@ Activations in one layer are conditionally 
independent given the activations in the other layer 
@ — efficient training algorithm (Contrastive 





Divergence): mg Ae 
@ From a training sample v, compute the probabilities @ oz 
of the hidden units and sample a hidden activation o< 
vector h (vh' ... positive gradient). eo 


@ From h, sample a reconstruction v’ of the visible 
units, then resample the hidden activations h’ from 
this (Gibbs sampling; v’h"' ... negative gradient). 

@ Update the weights with Aw;; = e(vh' — v/h’"). 
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Deep Learning Approaches 


Autoencoder 


@ Feed-forward network trained to replicate its input (input is target 
signal for training) + unsupervised method 


[decode 


Jencode 





@ Objective: minimize some form of reconstruction error 
@ Forces the network to learn a (compressed) representation of the 
input in hidden layers 
e For 1 hidden layer with k linear units, hidden neurons span the 
subspace of the first k principal components 
e With non-linear units, more complex representations can be learned 
e With stochastic units, corrupted input can be cleaned — denoising 
autoencoder 
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Deep Learning Approaches 


Autoencoder 


output -- 


[decode 
hidden —------- 


encode 





input 


e@ Dimensionality of the hidden layer can be smaller or larger than 
that of input/output 
e Smaller: yields a compressed representation 
e Larger: results in a mapping to a higher-dimensional feature space 


@ Typically trained with a form of stochastic gradient descent 
e@ Deep autoencoder: # hidden layers > 1 
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Deep Learning Approaches 


Stacked Autoencoder 


e@ Autoencoder can be used as the learning module bie 
within a Deep Belief Network — stacked es Veecde 


autoencoders te Jencode 
@ Train the first layer as an autoencoder to minimize ° 
some form of reconstruction error of the raw input Hidden layer 3 
@ The hidden units’ outputs (i.e., the codes) of the hi 
autoencoder are now used as input for another Hidden layer 2 
layer, also trained to be an autoencoder 
@ Repeat (2) until the desired number of additional 




















Hidden layer 1 


layers is reached =a 


@ Can also be used for pre-training a network followed [icine iayer (opservea) 
by fine-tuning via supervised learning 
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Deep Learning Approaches 


Convolutional neural network (CNN) 


e@ Before DBNs, supervised deep neural networks have been difficult 
to train, with one exception: convolutional neural networks (CNNs) 


@ inspired by biological processes in the visual cortex 


—% a.! 


pool size 
+ 


AS % %- 


Las — 
RF size 
input image 


@ Topological structure: neurons are arranged in filter maps that 
compute the same features for different parts of the input 
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Deep Learning Approaches 


Convolutional neural network (CNN) 


@ Typical CNNs have 5-7 layers 
e@ ACNN for handwritten digit recognition (LeCun et al., 1998): 





Convolutions ‘Subsampling —-Convolutions Subsampling Full connection 


@ Reasons why standard gradient descent methods are tractable for 
CNNs: 
e Sparse connectivity: Neurons receive input only from a local 
receptive field (RF) 


e Shared weights: Each neuron computes the same function for each 
RF 


e Pooling: Predefined function instead of learnt weights for some 
layers, e.g. max 
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Deep Learning Approaches 


Deep stacking network (DSN) 


@ Simple classifiers are stacked on top of each 
other to learn a complex classifier 


e@ e.g., CRFs, two-layer networks 


@ Originally designed for scalability: simple 
classifiers can be efficiently trained (convex 
optimization + “deep convex network”) 


@ Features for a classifier at a higher level are a 
concatenation of the classifier outputs of 
lower modules and the raw input features 
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Deep Learning 


Approaches 


Recursive Neural Tensor Network (RNTN) 


@ Tree structure with a neural network at each node 


e@ Used in natural language processing, e.g., for sentiment detection 
@ Socher et al., 2013: parse sentences into a binary tree, and at each 
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Deep Learning Approaches 


Deep learning with textual data 


@ Text has to be transformed into real-valued vectors that deep 
learning algorithms can understand 


@ Word2Vec: efficient algorithms developed by Google 
(https: //code. google.com/p/word2vec/) 


@ Word2Vec itself is not deep learning (it uses shallow ML methods) 


@ Given a text it automatically learns relationships between words 
based on their context 


e@ Each word is represented by a vector in a space where related 
words are close to each other, i.e. word embedding 


e@ Word vectors can be used as features in many natural language 
processing and machine learning applications 
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Deep Learning Approaches 


Deep learning with textual data 


@ Interesting properties of Word2Vec vectors: 


e vector('Paris’) — vector('France’) + vector(‘Italy’) = vector(/Rome’) 
e vector('king’) — vector(’man’) + vector(‘woman’) = vector('queen’) 





-eaing 
“beara Toyo 


Tuher 





@ Training is performed via a two-layer neural network (hierarchical 
softmax or negative sampling) 


@ Input (word context) is represented as continuous bag of words or 
skip-grams 
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Deep Learning Approaches 


Categorization of Deep Learning approaches 


e@ Deep networks for unsupervised learning 
e e.g., Restricted Boltzmann Machines, Deep Belief Networks, 
autoencoders, Deep Boltzmann machines, ... 
@ Deep networks for supervised learning 
e@ e.g., Convolutional Neural Networks, Deep Stacking Networks, ... 
@ Hybrid deep networks: make use of both unsupervised and 
supervised learning (e.g., “pre-training”) 
@ e.g., pre-training a Deep Belief Network composed of Restricted 
Boltzmann Machines 
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Deep Learning Approaches 


Alternative Deep Learning Architectures 


Deep learning is not limited to neural networks 


@ Stacked SVMs with random projections 
Vinyals, Ji, Deng, & Darrell. Learning with Recursive Perceptual 
Representations. 


http: //books.nips.cc/papers/files/nips25/NIPS2012_1290.pdf 
@ Sum-product networks 


Gens & Domingos. Discriminative Learning of Sum-Product 
Networks. 


http: //books.nips.cc/papers/files/nips25/NIPS2012_1484.pdf 
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Deep Learning Approaches 


How to choose the right network? 


Data Sector Use Case Input Transform Neural Net 
Sentiment Gaussian RNTN or DBN (with 

Text analysis Word vector Rectified moving window) 
Named-entity Gaussian RNTN ot DBN (with 
recognition Word vector Rectified moving window) 
Part-of-speech Gaussian RNTN or DBN (with 
tagging Word vector Rectified moving window) 
‘Semantic-role Gaussian RNTN ot DBN (with 
labeling Word vector Rectified moving window) 
Topic modeling/ Deep Autoencoder 
semantic hashing Word count (wrapping a DBN or 

Document (unsupervised) probability Can be Binary SDA) 
Document TF-IDF Deep-belief network, 
classification (or word Stacked Dencising 
(Supervised) count prob) _Binary Autoencoder 
Image Binary (visible 

Image recognition Binary and hidden) Deep-belief network 

Gaussian 
Continuous Rectified Deep-belief network 
Convolutional Net, 
RNTN (image 
Muiti-object, vectorization 
recognition forthcoming) 
Image search! Gaussian Deep Autoencoder 
semantic hashing Rectified (wrapping a DBN) 
Gaussian 
Sound Voice recognition Rectified Recurrent Net 


Moving window for 
DBN or ConvNet 


Predictive Gaussian 
Time Series _ analytics Rectified Recurrent Net 


Moving window for 
DBN or ConvNet 


http: //deeplearning4j .org/neuralnetworktable.html 
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Deep Learning Approaches 


Limitations of Deep Learning 


Limitations 
@ Team at Google made an interesting finding 
@ Small changes in the input yield an big, “unexpected” change in 
the output 
@ Left images are labelled correctly, the right images are 
misclassified, the image in the centre the shows the difference 
between the images 
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Deep Learning Approaches 


Available Software Toolkits 


Available toolkit to get started 
e@ Theano 
e@ Torch 
@ deeplearning4) 
@ Oxdata / H20 
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Deep Learning Approaches 


Resources 


@ Y. Bengio (2009). Learning Deep Architectures for Al. Foundations 
and Trends in Machine Learning, 2(1), 1-127. 


@ L. Deng and D. Yu (2014). Deep Learning: Methods and 
Applications. Foundations and Trends in Signal Processing, 7(3-4), 
197-387. 


@ J. Schmidhuber (2014). Deep Learning in Neural Networks: An 
Overview. http://arxiv.org/abs/1404.7828. 


http://deeplearning.net/ 
http://en.wikipedia.org/wiki/Deep_learning 
http://cl.naist.jp/ kevinduh/a/deep2014/ 
etc. 
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Deep Learning Approaches 


The End 


Next: Presentation: Planned Approach 
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